K-Means Clustering

Stacy Chandisingh, Leo Pena, Shaif Hossain

11/20/23

Introduction: What is K-Means Clustering?

  • An unsupervised machine learning technique
  • K-means is a fixed number (k) of clusters in a dataset
  • Used to draw conclusions about a dataset based on groups of similar variables
  • Different ways to determine the ‘k’
  • Used a Spotify dataset focused on attributes (examples: tempo and popularity) surrounding a music track to group similar features by clustering them together
  • We want to enhance our listener experience by recommending popular songs with similar song attributes to the favorite songs that they enjoy.

Methods: diving into the K-Means algorithm

  • K-means assumes the clusters are symmetric, there can be asymmetrical clusters.
  • The data is assumed to be independent.
  • Properly scaled data is important for K-Means, it can lead to uneven clusters.
  • There is a fixed number of clusters for K-Means.
  • Descriptive statistics/histograms/correlation matrix to visualize the spread of the data.

  • Preprocess and scale data for ease of comparability.

  • Libraries used: tidyverse, cluster, and factoextra. 

  • The Euclidean distances are calculated to get the clustering distances measurement

  • The distance data is visualized using the fviz_dist() function in R from the factoextra package. 

\[WCSS = \sum_{i=1}^{K}\sum_{j=1}^{ni}\left \| x_{ij} - c_{i} \right \|^2\]

  • K is number of clusters.
  • ni is number of data points in cluster i.
  • Ci is centroid.
  • Xij is the jth data point in cluster I.

The WCSS can be used to measure how the data within a cluster are grouped. The variables that are used from the dataset are then scaled. Then, after they are scaled the next steps are:

Euclidean distance is calculated between each attribute, and the cluster center: \[ D_{euclidean}(x, Ki) = \sqrt{\sum_{i=1}^{n}((x_{i}) -(k)_{ij})^{2}} \]

Where:

  • x is data point.
  • ki is cluster center.
  • n is number of attributes.
  • xi is the ith attribute of the data point.
  • Kij is the jth attribute of the centroid of cluster i.

Analysis and Results: Data and Vizualisation

Data source: Spotify data from Kaggle

include snippets of data cleaning code. use stuff from discord

  • Covariance is a statistical measure revealing the relationship between variables.
  • Positive numbers indicate a positive relationship, such as the positive number at loudness and energy, telling us that as the loudness increases in a track so does the energy.
  • In the other direction, the higher the negative number the smaller correlation that is involved, such as the difference between loudness and acousticness.

covariance

corrplot

In the correlation plot, the darker the blue the greater the correlation between the variables. The chart shows a positive correlation between energy and loudness and a negative correlation between acousticness and energy.

  • Histogram of variables to assess distribution and normality

histogram

Visualizing the data clusters

  • The point at which the line starts to curve is where we get our “k” clusters. Hence the name “Elbow Curve”
  • The visual will shown a bend “elbow” due to the increasing value of K decreasing the number of datapoints. This bend is the indicator of the ideal number of clusters for our model.
  • In our plot, we see that 3 is the ideal number of clusters, meaning there are 3 centroids to determine where the data points will fit the closest too.

elbow

  • Three distinct clusters are shown here.

clusters

What did we find?

  • In cluster 1, there are 2,292 records.

  • In cluster 2, there are 1,263 records.

  • In cluster 3, there are 3,010 records.

Centroid Center Positions:

centroid

  • The ratio of between-cluster sum of squares (BSS) to total sum of squares (TSS).

  • This measurement shows how well spread the clusters are between a value of 0 and 1.

  • The closer to 1, the more distinct the clusters are within the dataset.

  • In our model, the BSS/TSS ratio is 0.2042409, which is a pretty low ratio for this type of model. However, we determine a low number of clusters was sufficient in this model, which would also result in a low BSS/TSS ratio.

BSS/TSS Ratio: 0.2039852

min_max

  • This table shows which cluster has the miminum and maximum of each attribute.
  • This table briefly summarizes the types of songs we have in each cluster.

Popularity Assessment

popularity

We would like to see the number of popular songs in each cluster. Just like our previous analysis we take our spotify review set and see the number of popular tracks in each cluster based on our popularity score of >57. - In cluster 1, there are 492 popular tracks. - In cluster 2, there are 405 popular tracks. - In cluster 3, there are 1,241 popular tracks.

The graph compares the groups as we use this new only_pop data to see what type of popular tracks are in each cluster.

1 cluster1

In our k means clustering model, cluster 1 has the highest duration_min, energy, instrumentalness, liveness, loudness, and tempo. These are fast and upbeat songs that include Blinding Lights, Crazy Train, and Sweet Child of Mine. The majority of genres are pop, rock, and EDM, which make up our most popular groups. These sounds are low on acousticness, dancebilitity, speechiness, and valence. This cluster also has the lowest number of popular tracks.

2 cluster2

In our k means clustering model cluster 2 has the highest danceability, speechiness, valence, and the highest number of popular songs for our clusters. These are cheerful and vocal songs that include Memories, Falling, and everything I wanted as its top songs. The majority of genres are pop, latin, and rock. These sounds are low on duration_min, dancebilitity and instrumentalness.

3 cluster3

Cluster 3

In our k means clustering model cluster 3 has the highest acousticness. These are soft rock or acoustic songs that include Roxanne, The Box, and Circles. These sounds are low on energy, liveness, loudness, and tempo. This cluster is very close to having the lowest number of popular track, and the top genres are latin, rap, and pop.

Conclusions

  • K-Means provides a simple yet insightful way to glean data insights.

  • Unsupervised machine learning finds hidden data structures from an unlabeled dataset and its aim is to find the similarities within the data groups.

  • Three cluster groups were found.

  • We can offer song recommendations based on this analysis

  • We recommend using other K-Means clustering methods to verify whether results are similar.

References